The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.
# Read and manipulate data
import pandas as pd
import numpy as np
# Data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# Missing value imputation
from sklearn.impute import SimpleImputer
# Model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
from sklearn.dummy import DummyClassifier
# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay,
RocCurveDisplay,
)
# Data scaling and encoding
from sklearn.preprocessing import (
StandardScaler,
MinMaxScaler,
OneHotEncoder,
RobustScaler,
)
from sklearn.impute import SimpleImputer
# Tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# Creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import TransformerMixin
# Oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# Set the background for the graphs
plt.style.use("ggplot")
# Printing style
!pip install tabulate
from tabulate import tabulate
# To supress warnings
import warnings
# Date time
from datetime import datetime
warnings.filterwarnings("ignore")
Requirement already satisfied: tabulate in /usr/local/lib/python3.10/dist-packages (0.9.0)
# Run the following lines for Google Colab to mount your Google drive
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
# Load the dataset
original = pd.read_csv('/content/drive/My Drive/Advanced_Machine_Learning/BankChurners.csv')
# Copy the dataset to avoid changing the original
data = original.copy()
# Display basic information and first 5 rows of the dataset
data.info(), data.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
(None,
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count \
0 768805383 Existing Customer 45 M 3
1 818770008 Existing Customer 49 F 5
2 713982108 Existing Customer 51 M 3
3 769911858 Existing Customer 40 F 4
4 709106358 Existing Customer 40 M 3
Education_Level Marital_Status Income_Category Card_Category \
0 High School Married $60K - $80K Blue
1 Graduate Single Less than $40K Blue
2 Graduate Married $80K - $120K Blue
3 High School NaN Less than $40K Blue
4 Uneducated Married $60K - $80K Blue
Months_on_book Total_Relationship_Count Months_Inactive_12_mon \
0 39 5 1
1 44 6 1
2 36 4 1
3 34 3 4
4 21 5 1
Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy \
0 3 12691.000 777 11914.000
1 2 8256.000 864 7392.000
2 0 3418.000 0 3418.000
3 1 3313.000 2517 796.000
4 0 4716.000 0 4716.000
Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 \
0 1.335 1144 42 1.625
1 1.541 1291 33 3.714
2 2.594 1887 20 2.333
3 1.405 1171 20 2.333
4 2.175 816 28 2.500
Avg_Utilization_Ratio
0 0.061
1 0.105
2 0.000
3 0.760
4 0.000 )
# Check for missing values
missing_values = data.isnull().sum()
missing_values = missing_values[missing_values > 0]
# Display the number of missing values for each feature
missing_values
Education_Level 1519 Marital_Status 749 dtype: int64
# Check for duplicated records
duplicated_records = data.duplicated()
num_duplicated_records = duplicated_records.sum()
# Display the number of duplicated records
num_duplicated_records
0
# Get the statistical summary of the numerical columns
numerical_summary = data.describe().transpose()
# Display the statistical summary
numerical_summary
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CLIENTNUM | 10127.000 | 739177606.334 | 36903783.450 | 708082083.000 | 713036770.500 | 717926358.000 | 773143533.000 | 828343083.000 |
| Customer_Age | 10127.000 | 46.326 | 8.017 | 26.000 | 41.000 | 46.000 | 52.000 | 73.000 |
| Dependent_count | 10127.000 | 2.346 | 1.299 | 0.000 | 1.000 | 2.000 | 3.000 | 5.000 |
| Months_on_book | 10127.000 | 35.928 | 7.986 | 13.000 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.000 | 3.813 | 1.554 | 1.000 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.000 | 2.341 | 1.011 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.000 | 2.455 | 1.106 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.000 | 8631.954 | 9088.777 | 1438.300 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.000 | 1162.814 | 814.987 | 0.000 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.000 | 7469.140 | 9090.685 | 3.000 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.000 | 0.760 | 0.219 | 0.000 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.000 | 4404.086 | 3397.129 | 510.000 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.000 | 64.859 | 23.473 | 10.000 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.000 | 0.712 | 0.238 | 0.000 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.000 | 0.275 | 0.276 | 0.000 | 0.023 | 0.176 | 0.503 | 0.999 |
The statistical summary of the numerical columns in the dataset is as follows:
# Get the statistical summary of the categorical columns
categorical_summary = data.describe(include=['object']).transpose()
# Display the statistical summary
categorical_summary
| count | unique | top | freq | |
|---|---|---|---|---|
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 |
| Gender | 10127 | 2 | F | 5358 |
| Education_Level | 8608 | 6 | Graduate | 3128 |
| Marital_Status | 9378 | 3 | Married | 4687 |
| Income_Category | 10127 | 6 | Less than $40K | 3561 |
| Card_Category | 10127 | 4 | Blue | 9436 |
The statistical summary of the categorical columns in the dataset is as follows:
# Below function prints unique value counts and percentages for the category/object type variables
def category_unique_value():
for cat_cols in (
data.select_dtypes(exclude=[np.int64, np.float64]).columns.unique().to_list()
):
print("Unique values and corresponding data counts for feature: " + cat_cols)
print("-" * 90)
df_temp = pd.concat(
[
data[cat_cols].value_counts(),
data[cat_cols].value_counts(normalize=True) * 100,
],
axis=1,
)
df_temp.columns = ["Count", "Percentage"]
print(df_temp)
print("-" * 90)
# Display the unique value counts and percentages for the category/object type variables
category_unique_value()
Unique values and corresponding data counts for feature: Attrition_Flag
------------------------------------------------------------------------------------------
Count Percentage
Existing Customer 8500 83.934
Attrited Customer 1627 16.066
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Gender
------------------------------------------------------------------------------------------
Count Percentage
F 5358 52.908
M 4769 47.092
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Education_Level
------------------------------------------------------------------------------------------
Count Percentage
Graduate 3128 36.338
High School 2013 23.385
Uneducated 1487 17.275
College 1013 11.768
Post-Graduate 516 5.994
Doctorate 451 5.239
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Marital_Status
------------------------------------------------------------------------------------------
Count Percentage
Married 4687 49.979
Single 3943 42.045
Divorced 748 7.976
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Income_Category
------------------------------------------------------------------------------------------
Count Percentage
Less than $40K 3561 35.163
$40K - $60K 1790 17.676
$80K - $120K 1535 15.157
$60K - $80K 1402 13.844
abc 1112 10.981
$120K + 727 7.179
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Card_Category
------------------------------------------------------------------------------------------
Count Percentage
Blue 9436 93.177
Silver 555 5.480
Gold 116 1.145
Platinum 20 0.197
------------------------------------------------------------------------------------------
Questions:
total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?Note: Answers are provided at the end of the EDA.
# Function to plot a boxplot and a histogram along the same scale
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# List of numerical variables to visualize
numerical_vars = [
'Customer_Age', 'Dependent_count', 'Months_on_book', 'Total_Relationship_Count',
'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit',
'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1',
'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'
]
# Visualize each numerical variable
for feature in numerical_vars:
histogram_boxplot(data, feature, figsize=(12, 7), kde=True, bins=None)
# Function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
- data (DataFrame): The data frame containing the categorical feature.
- feature (str): The name of the categorical feature.
- perc (bool): Whether to print percentages instead of counts.
- n (int): The number of categories to print (None for all).
"""
total = len(data[feature]) # length of the column
count = data[feature].value_counts().iloc[:n] # count per category
# Plot
plt.figure(figsize=(12, 6))
sns.countplot(data=data, x=feature, order=count.index)
plt.title(f'Distribution of {feature}')
plt.xticks(rotation=45)
# Print count/percentage
for i, key in enumerate(count.index):
val = count.iloc[i]
pct = 100 * val / total
if perc:
plt.text(i, val, f'{pct:.2f}%', ha='center', va='bottom')
else:
plt.text(i, val, f'{val}', ha='center', va='bottom')
plt.tight_layout()
plt.show()
# Call the function for each specified feature
categorical_features = [
'Attrition_Flag', 'Gender', 'Education_Level',
'Marital_Status', 'Income_Category', 'Card_Category'
]
for feature in categorical_features:
labeled_barplot(data, feature, perc=True, n=None)
The visualizations provide insights into the distribution of key variables:
Imbalance in Attrition_Flag: The significant imbalance in the 'Attrition_Flag' feature (target variable) might require addressing during model training to avoid bias towards the majority class.
Unknown Categories: The presence of "Unknown" categories in 'Education_Level' and 'Marital_Status' might indicate missing data and should be considered during data preprocessing.
Card_Category Imbalance: The 'Card_Category' feature is highly imbalanced and might have limited predictive power in its current form. Feature engineering or alternative handling might be considered.
Ordinal Encoding: Features like 'Education_Level' and 'Income_Category' might benefit from ordinal encoding due to the inherent order in their categories.
# Function to plot distributions
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
numerical_vars = [
'Customer_Age', 'Dependent_count', 'Months_on_book', 'Total_Relationship_Count',
'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit',
'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1',
'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'
]
target = 'Attrition_Flag'
for var in numerical_vars:
distribution_plot_wrt_target(data, var, target)
The visualizations provide insights into the relationships between the target variable and the numerical features.
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
# List of categorical predictors
categorical_predictors = ['Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category']
# Target variable
target = 'Attrition_Flag'
# Call the function for each predictor
for predictor in categorical_predictors:
print(f'\n\nStacked Barplot and Counts for: {predictor}\n')
stacked_barplot(data, predictor, target)
Stacked Barplot and Counts for: Gender Attrition_Flag Attrited Customer Existing Customer All Gender All 1627 8500 10127 F 930 4428 5358 M 697 4072 4769 ------------------------------------------------------------------------------------------------------------------------
Stacked Barplot and Counts for: Education_Level Attrition_Flag Attrited Customer Existing Customer All Education_Level All 1371 7237 8608 Graduate 487 2641 3128 High School 306 1707 2013 Uneducated 237 1250 1487 College 154 859 1013 Doctorate 95 356 451 Post-Graduate 92 424 516 ------------------------------------------------------------------------------------------------------------------------
Stacked Barplot and Counts for: Marital_Status Attrition_Flag Attrited Customer Existing Customer All Marital_Status All 1498 7880 9378 Married 709 3978 4687 Single 668 3275 3943 Divorced 121 627 748 ------------------------------------------------------------------------------------------------------------------------
Stacked Barplot and Counts for: Income_Category Attrition_Flag Attrited Customer Existing Customer All Income_Category All 1627 8500 10127 Less than $40K 612 2949 3561 $40K - $60K 271 1519 1790 $80K - $120K 242 1293 1535 $60K - $80K 189 1213 1402 abc 187 925 1112 $120K + 126 601 727 ------------------------------------------------------------------------------------------------------------------------
Stacked Barplot and Counts for: Card_Category Attrition_Flag Attrited Customer Existing Customer All Card_Category All 1627 8500 10127 Blue 1519 7917 9436 Silver 82 473 555 Gold 21 95 116 Platinum 5 15 20 ------------------------------------------------------------------------------------------------------------------------
The visualizations provide insights into the relationships between the target variable and the categorical features.
# Function to plot distributions
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
numerical_vars = [
'Customer_Age', 'Dependent_count', 'Months_on_book', 'Total_Relationship_Count',
'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit',
'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1',
'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'
]
target = 'Attrition_Flag'
for var in numerical_vars:
distribution_plot_wrt_target(data, var, target)
# Setting the figure size and color palette
plt.figure(figsize=(20, 20))
sns.set(palette='nipy_spectral')
# Creating a pairplot for the data with hue as 'Attrition_Flag'
sns.pairplot(data=data, hue='Attrition_Flag', corner=True)
Output hidden; open in https://colab.research.google.com to view.
Since the pairplot is quite extensive and involves multiple variables, providing detailed observations for each pair might be overwhelming.
# Sample Multivariate Analysis: Interactions between different independent variables
plt.figure(figsize=(10, 8))
sns.scatterplot(x='Total_Trans_Amt', y='Total_Trans_Ct', hue='Attrition_Flag', data=data, alpha=0.6)
plt.title('Total Transaction Amount vs. Total Transaction Count colored by Attrition Flag')
plt.show()
In the visualizations:
# Set the aesthetic style of the plots
sns.set_style('whitegrid')
# Question 1: How is the total transaction amount distributed?
plt.figure(figsize=(10, 6))
sns.histplot(data['Total_Trans_Amt'], bins=30, kde=True)
plt.title('Distribution of Total Transaction Amount')
plt.xlabel('Total Transaction Amount')
plt.ylabel('Frequency')
plt.show()
# Set the aesthetic style of the plots
sns.set_style('whitegrid')
# Question 2: What is the distribution of the level of education of customers?
plt.figure(figsize=(10, 6))
sns.countplot(y=data['Education_Level'], order=data['Education_Level'].value_counts().index)
plt.title('Distribution of Education Level of Customers')
plt.xlabel('Count')
plt.ylabel('Education Level')
plt.show()
# Set the aesthetic style of the plots
sns.set_style('whitegrid')
# Question 3: What is the distribution of the level of income of customers?
plt.figure(figsize=(10, 6))
sns.countplot(y=data['Income_Category'], order=data['Income_Category'].value_counts().index)
plt.title('Distribution of Income Level of Customers')
plt.xlabel('Count')
plt.ylabel('Income Category')
plt.show()
total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?# Question 4: How does the change in transaction amount between Q4 and Q1 (Total_Ct_Chng_Q4_Q1) vary by the customer's account status (Attrition_Flag)?
plt.figure(figsize=(10, 6))
sns.boxplot(x=data['Attrition_Flag'], y=data['Total_Ct_Chng_Q4_Q1'])
plt.title('Variation of Change in Transaction Count (Q4 to Q1) by Account Status')
plt.xlabel('Attrition Flag')
plt.ylabel('Change in Transaction Count (Q4 to Q1)')
plt.show()
Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?# Question 5: How does the number of months a customer was inactive in the last 12 months (Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?
plt.figure(figsize=(10, 6))
sns.boxplot(x=data['Attrition_Flag'], y=data['Months_Inactive_12_mon'])
plt.title('Variation of Inactive Months in Last 12 Months by Account Status')
plt.xlabel('Attrition Flag')
plt.ylabel('Inactive Months in Last 12 Months')
plt.show()
# Correlation Analysis
correlation_matrix = data.corr()
plt.figure(figsize=(14, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=.5)
plt.title('Correlation Matrix of Variables')
plt.show()
Correlation Matrix: The heatmap shows the correlation between different variables. Some notable correlations include:
Positive Correlations:
Negative Correlations:
# Checking for missing values in the dataset
missing_values = data.isnull().sum()
missing_values = missing_values[missing_values > 0].sort_values(ascending=False)
# Displaying the count of missing values for each variable that has them
missing_values
Education_Level 1519 Marital_Status 749 dtype: int64
# Replacing missing values with 'Unknown' for 'Education_Level' and 'Marital_Status'
data['Education_Level'].fillna('Unknown', inplace=True)
data['Marital_Status'].fillna('Unknown', inplace=True)
# Verifying that there are no more missing values
missing_values_after = data[['Education_Level', 'Marital_Status']].isnull().sum()
missing_values_after
Education_Level 0 Marital_Status 0 dtype: int64
# Encoding Categorical Variables
# One-hot encoding for categorical variables
categorical_vars = ['Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category']
data_encoded = pd.get_dummies(data, columns=categorical_vars, drop_first=True)
# Binary encoding for the target variable 'Attrition_Flag'
data_encoded['Attrition_Flag'] = data_encoded['Attrition_Flag'].map({'Existing Customer': 0, 'Attrited Customer': 1})
# Displaying the first few rows of the encoded data
data_encoded.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Dependent_count | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | Gender_M | Education_Level_Doctorate | Education_Level_Graduate | Education_Level_High School | Education_Level_Post-Graduate | Education_Level_Uneducated | Education_Level_Unknown | Marital_Status_Married | Marital_Status_Single | Marital_Status_Unknown | Income_Category_$40K - $60K | Income_Category_$60K - $80K | Income_Category_$80K - $120K | Income_Category_Less than $40K | Income_Category_abc | Card_Category_Gold | Card_Category_Platinum | Card_Category_Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | 0 | 45 | 3 | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 818770008 | 0 | 49 | 5 | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 713982108 | 0 | 51 | 3 | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 3 | 769911858 | 0 | 40 | 4 | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 4 | 709106358 | 0 | 40 | 3 | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
data_encoded.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 34 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null int64 2 Customer_Age 10127 non-null int64 3 Dependent_count 10127 non-null int64 4 Months_on_book 10127 non-null int64 5 Total_Relationship_Count 10127 non-null int64 6 Months_Inactive_12_mon 10127 non-null int64 7 Contacts_Count_12_mon 10127 non-null int64 8 Credit_Limit 10127 non-null float64 9 Total_Revolving_Bal 10127 non-null int64 10 Avg_Open_To_Buy 10127 non-null float64 11 Total_Amt_Chng_Q4_Q1 10127 non-null float64 12 Total_Trans_Amt 10127 non-null int64 13 Total_Trans_Ct 10127 non-null int64 14 Total_Ct_Chng_Q4_Q1 10127 non-null float64 15 Avg_Utilization_Ratio 10127 non-null float64 16 Gender_M 10127 non-null uint8 17 Education_Level_Doctorate 10127 non-null uint8 18 Education_Level_Graduate 10127 non-null uint8 19 Education_Level_High School 10127 non-null uint8 20 Education_Level_Post-Graduate 10127 non-null uint8 21 Education_Level_Uneducated 10127 non-null uint8 22 Education_Level_Unknown 10127 non-null uint8 23 Marital_Status_Married 10127 non-null uint8 24 Marital_Status_Single 10127 non-null uint8 25 Marital_Status_Unknown 10127 non-null uint8 26 Income_Category_$40K - $60K 10127 non-null uint8 27 Income_Category_$60K - $80K 10127 non-null uint8 28 Income_Category_$80K - $120K 10127 non-null uint8 29 Income_Category_Less than $40K 10127 non-null uint8 30 Income_Category_abc 10127 non-null uint8 31 Card_Category_Gold 10127 non-null uint8 32 Card_Category_Platinum 10127 non-null uint8 33 Card_Category_Silver 10127 non-null uint8 dtypes: float64(5), int64(11), uint8(18) memory usage: 1.4 MB
One-Hot Encoding was applied to the following categorical variables:
This process created binary (0 or 1) columns for each category in the original columns. For example, Education_Level which had categories like 'Graduate', 'High School', etc., has been transformed into separate columns like Education_Level_Graduate, Education_Level_High School, each indicating the presence or absence of the category with 1 or 0 respectively.
Binary Encoding was applied to the target variable Attrition_Flag: 'Existing Customer': 0 'Attrited Customer': 1
Now, our dataset data_encoded has 34 columns, where the categorical variables have been transformed into a format suitable for modeling.
# List of numerical variables
numerical_vars = [
'Customer_Age', 'Dependent_count', 'Months_on_book', 'Total_Relationship_Count',
'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit',
'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1',
'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'
]
# Plotting boxplots for numerical variables to visualize outliers
plt.figure(figsize=(20, 10))
for i, var in enumerate(numerical_vars):
plt.subplot(3, 5, i+1)
sns.boxplot(y=var, data=data_encoded)
plt.title(f'Boxplot of {var}')
plt.tight_layout()
plt.show()
# List of numerical variables
numerical_vars = [
'Customer_Age', 'Dependent_count', 'Months_on_book', 'Total_Relationship_Count',
'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit',
'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1',
'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'
]
# Applying capping to the numerical variables
for var in numerical_vars:
lower_bound = data_encoded[var].quantile(0.01) # 1st percentile
upper_bound = data_encoded[var].quantile(0.99) # 99th percentile
data_encoded[var] = data_encoded[var].clip(lower=lower_bound, upper=upper_bound)
# Verifying the capping by visualizing boxplots again
plt.figure(figsize=(20, 10))
for i, var in enumerate(numerical_vars):
plt.subplot(3, 5, i+1)
sns.boxplot(y=var, data=data_encoded)
plt.title(f'Boxplot of {var}')
plt.tight_layout()
plt.show()
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Assuming data_encoded is your pre-processed data
# Defining the target variable and predictors
X = data_encoded.drop(columns=['Attrition_Flag'])
y = data_encoded['Attrition_Flag']
# Splitting the data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# Initializing the Standard Scaler
scaler = StandardScaler()
# Fitting the scaler on the training data and transforming both training and testing data
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)
# Displaying the first few rows of the scaled training data
X_train_scaled.head()
| CLIENTNUM | Customer_Age | Dependent_count | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | Gender_M | Education_Level_Doctorate | Education_Level_Graduate | Education_Level_High School | Education_Level_Post-Graduate | Education_Level_Uneducated | Education_Level_Unknown | Marital_Status_Married | Marital_Status_Single | Marital_Status_Unknown | Income_Category_$40K - $60K | Income_Category_$60K - $80K | Income_Category_$80K - $120K | Income_Category_Less than $40K | Income_Category_abc | Card_Category_Gold | Card_Category_Platinum | Card_Category_Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.318 | -1.306 | -1.810 | -1.523 | 0.122 | 0.647 | -0.412 | -0.665 | 1.304 | -0.784 | -0.354 | -0.779 | -0.969 | -1.229 | 2.157 | -0.947 | -0.214 | -0.665 | -0.506 | -0.236 | -0.414 | 2.390 | 1.076 | -0.796 | -0.284 | -0.467 | -0.404 | -0.419 | 1.364 | -0.350 | -0.108 | -0.046 | -0.237 |
| 1 | -0.560 | -0.301 | 0.502 | -0.003 | 0.765 | -0.347 | 0.501 | 1.849 | -1.433 | 1.980 | 0.452 | -0.623 | -1.139 | -0.593 | -1.002 | 1.056 | -0.214 | -0.665 | 1.977 | -0.236 | -0.414 | -0.418 | -0.930 | 1.256 | -0.284 | -0.467 | 2.476 | -0.419 | -0.733 | -0.350 | -0.108 | -0.046 | 4.219 |
| 2 | 0.879 | -0.050 | -0.269 | -0.763 | 1.409 | 0.647 | -1.325 | 0.342 | -0.310 | 0.371 | 0.665 | -0.032 | 1.036 | 0.683 | -0.718 | 1.056 | -0.214 | -0.665 | 1.977 | -0.236 | -0.414 | -0.418 | 1.076 | -0.796 | -0.284 | -0.467 | -0.404 | 2.388 | -0.733 | -0.350 | -0.108 | -0.046 | -0.237 |
| 3 | -0.662 | -1.306 | -0.269 | -1.523 | -0.521 | -1.340 | 0.501 | -0.604 | 0.522 | -0.652 | 0.551 | -0.809 | -1.012 | -1.610 | 0.854 | -0.947 | -0.214 | 1.505 | -0.506 | -0.236 | -0.414 | -0.418 | 1.076 | -0.796 | -0.284 | -0.467 | -0.404 | -0.419 | 1.364 | -0.350 | -0.108 | -0.046 | -0.237 |
| 4 | -0.579 | 0.453 | -1.039 | 0.504 | 0.122 | -0.347 | 0.501 | 2.872 | 0.021 | 2.872 | -0.161 | -0.151 | 0.311 | 0.024 | -0.878 | 1.056 | -0.214 | 1.505 | -0.506 | -0.236 | -0.414 | -0.418 | -0.930 | 1.256 | -0.284 | -0.467 | -0.404 | -0.419 | -0.733 | -0.350 | -0.108 | -0.046 | -0.237 |
The nature of predictions made by the classification model will translate as follows:
Which metric to optimize?
We will evaluate the models using the following metrics:
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1
},
index=[0],
)
return df_perf
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
import xgboost as xgb
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
import matplotlib.pyplot as plt
# Defining the models
dt = DecisionTreeClassifier(random_state=1)
bag = BaggingClassifier(random_state=1)
rf = RandomForestClassifier(random_state=1)
ada = AdaBoostClassifier(random_state=1)
gb = GradientBoostingClassifier(random_state=1)
xg = xgb.XGBClassifier(random_state=1, use_label_encoder=False, eval_metric='logloss')
# Appending the models to a list
models = []
models.append(('DecisionTree', dt))
models.append(('Bagging', bag))
models.append(('RandomForest', rf))
models.append(('AdaBoost', ada))
models.append(('GradientBoost', gb))
models.append(('XGBoost', xg))
# Looping through models to train, predict, and evaluate
for name, model in models:
# Fitting the model
model.fit(X_train, y_train)
# Evaluating and displaying metrics on training data
perf_train = model_performance_classification_sklearn(model, X_train, y_train)
print(f"\n{name} - Train Performance: {perf_train}")
y_pred_train = model.predict(X_train)
cm_train = confusion_matrix(y_train, y_pred_train)
disp_train = ConfusionMatrixDisplay(confusion_matrix=cm_train, display_labels=model.classes_)
disp_train.plot()
plt.title(f'{name} - Confusion Matrix on Original Train Data')
plt.show()
# Evaluating and displaying metrics on test data
perf_test = model_performance_classification_sklearn(model, X_test, y_test)
print(f"{name} - Test Performance: {perf_test}")
y_pred_test = model.predict(X_test)
cm_test = confusion_matrix(y_test, y_pred_test)
disp_test = ConfusionMatrixDisplay(confusion_matrix=cm_test, display_labels=model.classes_)
disp_test.plot()
plt.title(f'{name} - Confusion Matrix on Test Data')
plt.show()
DecisionTree - Train Performance: Accuracy Recall Precision F1 0 1.000 1.000 1.000 1.000
DecisionTree - Test Performance: Accuracy Recall Precision F1 0 0.936 0.782 0.812 0.796
Bagging - Train Performance: Accuracy Recall Precision F1 0 0.996 0.978 0.996 0.987
Bagging - Test Performance: Accuracy Recall Precision F1 0 0.952 0.778 0.907 0.838
RandomForest - Train Performance: Accuracy Recall Precision F1 0 1.000 1.000 1.000 1.000
RandomForest - Test Performance: Accuracy Recall Precision F1 0 0.956 0.766 0.947 0.847
AdaBoost - Train Performance: Accuracy Recall Precision F1 0 0.961 0.849 0.901 0.874
AdaBoost - Test Performance: Accuracy Recall Precision F1 0 0.954 0.800 0.903 0.848
GradientBoost - Train Performance: Accuracy Recall Precision F1 0 0.976 0.892 0.956 0.923
GradientBoost - Test Performance: Accuracy Recall Precision F1 0 0.964 0.815 0.953 0.879
XGBoost - Train Performance: Accuracy Recall Precision F1 0 1.000 1.000 1.000 1.000
XGBoost - Test Performance: Accuracy Recall Precision F1 0 0.969 0.855 0.949 0.900
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
import xgboost as xgb
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
import matplotlib.pyplot as plt
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
# Defining the models
dt = DecisionTreeClassifier(random_state=1)
bag = BaggingClassifier(random_state=1)
rf = RandomForestClassifier(random_state=1)
ada = AdaBoostClassifier(random_state=1)
gb = GradientBoostingClassifier(random_state=1)
xg = xgb.XGBClassifier(random_state=1, use_label_encoder=False, eval_metric='logloss')
# Appending the models to a list
models = []
models.append(('DecisionTree', dt))
models.append(('Bagging', bag))
models.append(('RandomForest', rf))
models.append(('AdaBoost', ada))
models.append(('GradientBoost', gb))
models.append(('XGBoost', xg))
# Looping through models to train, predict, and evaluate
for name, model in models:
# Fitting the model
model.fit(X_train_over, y_train_over)
# Evaluating and displaying metrics on training data
perf_train = model_performance_classification_sklearn(model, X_train_over, y_train_over)
print(f"\n{name} - Train Performance: {perf_train}")
y_pred_train_over = model.predict(X_train_over)
cm_train = confusion_matrix(y_train_over, y_pred_train_over)
disp_train = ConfusionMatrixDisplay(confusion_matrix=cm_train, display_labels=model.classes_)
disp_train.plot()
plt.title(f'{name} - Confusion Matrix on Oversampled Train Data')
plt.show()
# Evaluating and displaying metrics on test data
perf_test = model_performance_classification_sklearn(model, X_test, y_test)
print(f"{name} - Test Performance: {perf_test}")
y_pred_test = model.predict(X_test)
cm_test = confusion_matrix(y_test, y_pred_test)
disp_test = ConfusionMatrixDisplay(confusion_matrix=cm_test, display_labels=model.classes_)
disp_test.plot()
plt.title(f'{name} - Confusion Matrix on Test Data')
plt.show()
DecisionTree - Train Performance: Accuracy Recall Precision F1 0 1.000 1.000 1.000 1.000
DecisionTree - Test Performance: Accuracy Recall Precision F1 0 0.912 0.809 0.692 0.746
Bagging - Train Performance: Accuracy Recall Precision F1 0 0.998 0.998 0.999 0.998
Bagging - Test Performance: Accuracy Recall Precision F1 0 0.942 0.840 0.808 0.824
RandomForest - Train Performance: Accuracy Recall Precision F1 0 1.000 1.000 1.000 1.000
RandomForest - Test Performance: Accuracy Recall Precision F1 0 0.945 0.815 0.836 0.826
AdaBoost - Train Performance: Accuracy Recall Precision F1 0 0.962 0.965 0.959 0.962
AdaBoost - Test Performance: Accuracy Recall Precision F1 0 0.938 0.846 0.786 0.815
GradientBoost - Train Performance: Accuracy Recall Precision F1 0 0.974 0.974 0.974 0.974
GradientBoost - Test Performance: Accuracy Recall Precision F1 0 0.955 0.868 0.855 0.861
XGBoost - Train Performance: Accuracy Recall Precision F1 0 1.000 1.000 1.000 1.000
XGBoost - Test Performance: Accuracy Recall Precision F1 0 0.965 0.865 0.912 0.888
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
import xgboost as xgb
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
import matplotlib.pyplot as plt
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
# Defining the models
dt = DecisionTreeClassifier(random_state=1)
bag = BaggingClassifier(random_state=1)
rf = RandomForestClassifier(random_state=1)
ada = AdaBoostClassifier(random_state=1)
gb = GradientBoostingClassifier(random_state=1)
xg = xgb.XGBClassifier(random_state=1, use_label_encoder=False, eval_metric='logloss')
# Appending the models to a list
models = []
models.append(('DecisionTree', dt))
models.append(('Bagging', bag))
models.append(('RandomForest', rf))
models.append(('AdaBoost', ada))
models.append(('GradientBoost', gb))
models.append(('XGBoost', xg))
# Looping through models to train, predict, and evaluate
for name, model in models:
# Fitting the model
model.fit(X_train_un, y_train_un)
# Evaluating and displaying metrics on training data
perf_train = model_performance_classification_sklearn(model, X_train_un, y_train_un)
print(f"\n{name} - Train Performance: {perf_train}")
y_pred_train_un = model.predict(X_train_un)
cm_train = confusion_matrix(y_train_un, y_pred_train_un)
disp_train = ConfusionMatrixDisplay(confusion_matrix=cm_train, display_labels=model.classes_)
disp_train.plot()
plt.title(f'{name} - Confusion Matrix on Undersampled Train Data')
plt.show()
# Evaluating and displaying metrics on test data
perf_test = model_performance_classification_sklearn(model, X_test, y_test)
print(f"{name} - Test Performance: {perf_test}")
y_pred_test = model.predict(X_test)
cm_test = confusion_matrix(y_test, y_pred_test)
disp_test = ConfusionMatrixDisplay(confusion_matrix=cm_test, display_labels=model.classes_)
disp_test.plot()
plt.title(f'{name} - Confusion Matrix on Test Data')
plt.show()
DecisionTree - Train Performance: Accuracy Recall Precision F1 0 1.000 1.000 1.000 1.000
DecisionTree - Test Performance: Accuracy Recall Precision F1 0 0.904 0.905 0.643 0.752
Bagging - Train Performance: Accuracy Recall Precision F1 0 0.995 0.992 0.998 0.995
Bagging - Test Performance: Accuracy Recall Precision F1 0 0.940 0.920 0.757 0.831
RandomForest - Train Performance: Accuracy Recall Precision F1 0 1.000 1.000 1.000 1.000
RandomForest - Test Performance: Accuracy Recall Precision F1 0 0.929 0.902 0.725 0.804
AdaBoost - Train Performance: Accuracy Recall Precision F1 0 0.948 0.958 0.940 0.949
AdaBoost - Test Performance: Accuracy Recall Precision F1 0 0.933 0.942 0.725 0.819
GradientBoost - Train Performance: Accuracy Recall Precision F1 0 0.979 0.979 0.979 0.979
GradientBoost - Test Performance: Accuracy Recall Precision F1 0 0.948 0.938 0.782 0.853
XGBoost - Train Performance: Accuracy Recall Precision F1 0 1.000 1.000 1.000 1.000
XGBoost - Test Performance: Accuracy Recall Precision F1 0 0.951 0.929 0.797 0.858
Summary of Model Performances
1. Using Original Data:
XGBoost
XGBoost showed the best performance in terms of Recall score of 0.855 on the test data.
2. Using Oversampled Data:
XGBoost
Gradient Boost showed the best performance in terms of Recall score of 0.868 on the test data.
3. Using Undersampled Data:
XGBoost
AdaBoost showed the best performance in terms of Recall score of 0.942 on the test data.
We will check one of the non-Boosting algorithms:
import numpy as np
import pandas as pd
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
# Define parameter grids
param_grids = {
'XGBClassifier': {
'n_estimators': np.arange(50, 300, 50),
'scale_pos_weight': [0, 1, 2, 5, 10],
'learning_rate': [0.01, 0.1, 0.2, 0.05],
'gamma': [0, 1, 3, 5],
'subsample': [0.7, 0.8, 0.9, 1]
},
'GradientBoostingClassifier': {
'init': [AdaBoostClassifier(random_state=1), DecisionTreeClassifier(random_state=1)],
'n_estimators': np.arange(75, 150, 25),
'learning_rate': [0.1, 0.01, 0.2, 0.05, 1],
'subsample': [0.5, 0.7, 1],
'max_features': [0.5, 0.7, 1]
},
'AdaBoostClassifier': {
'n_estimators': np.arange(10, 110, 10),
'learning_rate': [0.1, 0.01, 0.2, 0.05, 1],
'base_estimator': [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1)
]
},
'RandomForestClassifier': {
'n_estimators': [200, 250, 300],
'min_samples_leaf': np.arange(1, 4),
'max_features': [np.arange(0.3, 0.6, 0.1), 'sqrt'],
'max_samples': np.arange(0.4, 0.7, 0.1)
}
}
# Define models
models = {
'XGBClassifier': XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
'GradientBoostingClassifier': GradientBoostingClassifier(),
'AdaBoostClassifier': AdaBoostClassifier(),
'RandomForestClassifier': RandomForestClassifier()
}
# Placeholder for model results
results = {}
# Perform Randomized Search
for model_name, model in models.items():
rs = RandomizedSearchCV(
model,
param_distributions=param_grids[model_name],
n_iter=50,
scoring='recall',
cv=10,
n_jobs=-1,
random_state=1
)
rs.fit(X_train, y_train)
results[model_name] = {
'best_params': rs.best_params_,
'best_score': rs.best_score_,
'best_estimator': rs.best_estimator_
}
# Display results
display(results)
{'XGBClassifier': {'best_params': {'subsample': 0.7,
'scale_pos_weight': 10,
'n_estimators': 150,
'learning_rate': 0.05,
'gamma': 3},
'best_score': 0.9523722842043452,
'best_estimator': XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.05, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=150,
n_jobs=None, num_parallel_tree=None, random_state=None, ...)},
'GradientBoostingClassifier': {'best_params': {'subsample': 0.7,
'n_estimators': 125,
'max_features': 0.7,
'learning_rate': 0.2,
'init': AdaBoostClassifier(random_state=1)},
'best_score': 0.8740575455079271,
'best_estimator': GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
learning_rate=0.2, max_features=0.7,
n_estimators=125, subsample=0.7)},
'AdaBoostClassifier': {'best_params': {'n_estimators': 90,
'learning_rate': 0.2,
'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)},
'best_score': 0.8848032883147386,
'best_estimator': AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.2, n_estimators=90)},
'RandomForestClassifier': {'best_params': {'n_estimators': 300,
'min_samples_leaf': 1,
'max_samples': 0.6,
'max_features': 'sqrt'},
'best_score': 0.7565413975337639,
'best_estimator': RandomForestClassifier(max_samples=0.6, n_estimators=300)}}
The output provides the best parameters found by RandomizedSearchCV for four different classifiers: XGBClassifier, GradientBoostingClassifier, AdaBoostClassifier, and RandomForestClassifier, along with their respective best recall scores obtained during cross-validation.
1. XGBClassifier
Observations: The XGBClassifier achieved the highest recall score among the four models. The scale_pos_weight parameter, which controls the balance of positive and negative weights, is set to 10, indicating a strategy to combat the imbalance in the dataset by giving higher importance to the minority class.
2. GradientBoostingClassifier
Observations: The GradientBoostingClassifier achieved a good recall score, but not as high as the XGBClassifier. The init parameter is set to use AdaBoostClassifier, which is interesting and might be contributing to the model's ability to focus on misclassified examples.
3. AdaBoostClassifier
Observations: The AdaBoostClassifier also performed well. The base estimator is a decision tree with a max depth of 3, which means the model uses slightly more complex weak learners than the default stump (max depth of 1).
4. RandomForestClassifier
Observations: The RandomForestClassifier has the lowest recall among the four models. The max_features parameter is set to 'sqrt', meaning that each tree in the forest is allowed to choose from the square root of the total number of features when splitting a node.
import matplotlib.pyplot as plt
import pandas as pd
# Recall scores
recall_scores = {
'XGBClassifier': 0.952,
'GradientBoostingClassifier': 0.876,
'AdaBoostClassifier': 0.885,
'RandomForestClassifier': 0.760
}
# Convert to DataFrame for visualization
recall_df = pd.DataFrame(list(recall_scores.items()), columns=['Model', 'Recall Score'])
# Plotting
plt.figure(figsize=(10, 6))
plt.barh(recall_df['Model'], recall_df['Recall Score'], color=['skyblue', 'orange', 'green', 'red'])
plt.xlabel('Recall Score')
plt.title('Recall Score Comparison of Tuned Models')
plt.xlim(0, 1)
plt.grid(axis='x')
# Adding the score labels
for index, value in enumerate(recall_df['Recall Score']):
plt.text(value, index, f'{value:.3f}', va='center')
plt.show()
1. XGBClassifier:
2. GradientBoostingClassifier:
3. AdaBoostClassifier:
Conclusion:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import xgboost as xgb
import seaborn as sns
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
# Defining the models
xg_tuned = xgb.XGBClassifier(random_state=1, use_label_encoder=False, eval_metric='logloss')
xg_tuned.fit(X_train, y_train)
# Predict the classes
y_pred = xg_tuned.predict(X_test)
# Generate the confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Display the confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Not Churn', 'Churn'])
fig, ax = plt.subplots(figsize=(8, 8))
disp.plot(ax=ax, cmap='Blues', values_format='.0f')
plt.title('Confusion Matrix for Tuned XGBoost on Test Set')
plt.show()
True Positive (TP): 278
True Negative: The model correctly predicted 1686 instances where customers did not churn, which is a strong performance in identifying non-churn cases.
1. Model Performance:
2. Feature Importance:
3. Customer Churn Insights:
1. Customer Engagement:
2. Credit Management:
3. Customer Feedback:
4. Loyalty Programs:
5. Customer Support:
1. Model Deployment:
2. Continuous Model Monitoring:
3. Further Analysis:
4. Customer Segmentation:
5. Model Explainability:
Addressing the problem statement, the XGBoost model has provided a robust solution to predict customer churn, enabling Thera Bank to proactively address customer retention. By implementing the above recommendations and continuously refining the model, the bank can significantly enhance its customer retention strategies and reduce churn effectively.